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ABSTRACT 

This  paper  describes  a  system  and  set  of  algorithms  for  automati¬ 
cally  inducing  stand-alone  monolingual  part-of-speech  taggers,  base 
noun-phrase  bracketers,  named-entity  taggers  and  morphological 
analyzers  for  an  arbitrary  foreign  language.  Case  studies  include 
French,  Chinese,  Czech  and  Spanish. 

Existing  text  analysis  tools  for  English  are  applied  to  bilingual 
text  corpora  and  their  output  projected  onto  the  second  language 
via  statistically  derived  word  alignments.  Simple  direct  annotation 
projection  is  quite  noisy,  however,  even  with  optimal  alignments. 
Thus  this  paper  presents  noise-robust  tagger,  bracketer  and  lemma- 
tizer  training  procedures  capable  of  accurate  system  bootstrapping 
from  noisy  and  incomplete  initial  projections. 

Performance  of  the  induced  stand-alone  part-of-speech  tagger 
applied  to  French  achieves  96%  core  part-of-speech  (POS)  tag  ac¬ 
curacy,  and  the  corresponding  induced  noun-phrase  bracketer  ex¬ 
ceeds  91%  F-measure.  The  induced  morphological  analyzer  achie¬ 
ves  over  99%  lemmatization  accuracy  on  the  complete  French  ver¬ 
bal  system. 

This  achievement  is  particularly  noteworthy  in  that  it  required 
absolutely  no  hand- annotated  training  data  in  the  given  language, 
and  virtually  no  language-specific  knowledge  or  resources  beyond 
raw  text.  Performance  also  significantly  exceeds  that  obtained  by 
direct  annotation  projection. 
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1.  TASK  OVERVIEW 

A  fundamental  roadblock  to  developing  statistical  taggers,  brack¬ 
eters  and  other  analyzers  for  many  of  the  world’s  200-1-  major  lan¬ 
guages  is  the  shortage  or  absence  of  annotated  training  data  for  the 
large  majority  of  these  languages.  Ideally,  one  would  like  to  lever- 
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Figure  1:  Projecting  part-of-speech  tags,  named-entity  tags  and 
noun-phrase  structure  from  English  to  Chinese  and  French. 


Figure  2:  French  morphological  analysis  via  English 


age  the  large  existing  investments  in  annotated  data  and  tools  for 
resource-rich  languages  (such  as  English  and  Japanese)  to  over¬ 
come  the  annotated  resource  shortage  in  other  languages. 

To  show  the  broad  potential  of  our  approach  and  methods,  this 
paper  will  investigate  four  fundamental  language  analysis  tasks: 
POS  tagging,  base  noun  phrase  (baseNP)  bracketing,  named  en¬ 
tity  tagging,  and  inflectional  morphological  analysis,  as  illustrated 
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in  Figures  1  and  2.  These  bedrock  tools  are  important  components 
of  the  language  analysis  pipelines  for  many  applications,  and  their 
low  cost  extension  to  new  languages,  as  described  here,  can  serve 
as  a  broadly  useful  enabling  resource. 

2.  BACKGROUND 

Previous  research  on  the  word  alignment  of  parallel  corpora  has 
tended  to  focus  on  their  use  in  translation  model  training  for  MT 
rather  than  on  monolingual  applications.  One  exception  is  bilin¬ 
gual  parsing.  Wu  (1995,  1997)  investigated  the  use  of  concurrent 
parsing  of  parallel  corpora  in  a  transduction  inversion  framework, 
helping  to  resolve  attachment  ambiguities  in  one  language  by  the 
coupled  parsing  state  in  the  second  language.  Jones  and  Havrilla 
(1998)  utilized  similar  joint  parsing  techniques  (twisted-pair  gram¬ 
mars)  for  word  reordering  in  target  language  generation. 

However,  with  these  exceptions  in  the  held  of  parsing,  to  our 
knowledge  no  one  has  previously  used  linguistic  annotation  pro¬ 
jection  via  aligned  bilingual  corpora  to  induce  traditional  stand¬ 
alone  monolingual  text  analyzers  in  other  languages.  Thus  both 
our  proposed  projection  and  induction  methods,  and  their  applica¬ 
tion  to  multilingual  POS  tagging,  named-entity  classihcation  and 
morphological  analysis  induction,  appears  to  be  highly  novel. 

3.  DATA  RESOURCES 

The  data  sets  used  in  these  experiments  included  the  English- 
French  Canadian  Hansards,  the  English-Chinese  Hong  Kong  Han¬ 
sards,  and  parallel  Czech-English  Reader’s  Digest  collection.  In 
addition,  multiple  versions  of  the  Bible  were  used,  including  the 
Erench  Douay-Rheims  Bible,  Spanish  Reina  Valera  Bible,  and  three 
English  Bible  Versions  (King  James,  New  International  and  Re¬ 
vised  Standard),  automatically  verse-aligned  in  multiple  pairings. 
All  corpora  were  automatically  word-aligned  by  the  now  publicly 
available  EGYPT  system  (Al-Onaizan  et  ah,  1999),  based  on  IBM’s 
Model  3  statistical  MT  formalism  (Brown  et  al.,  1990).  The  tag¬ 
ging  and  bracketing  tasks  utilized  approximately  2  million  words 
in  each  language,  with  the  sample  sizes  for  morphology  induc¬ 
tion  given  in  Table  3.  All  word  alignments  utilized  strictly  raw- 
word-based  model  variants  for  English/Erench/Spanish/Czech  and 
character-based  model  variants  for  Chinese,  with  no  use  of  mor¬ 
phological  analysis  or  stemming,  POS-tagging,  bracketing  or  dic¬ 
tionary  resources. 

4.  PART-OF-SPEECH  TAGGER  INDUCTION 

Part-of-speech  tagging  is  the  first  of  four  applications  covered  in 
this  paper.  The  goal  of  this  work  is  to  project  POS  analysis  capabil¬ 
ities  from  one  language  to  another  via  word-aligned  parallel  bilin¬ 
gual  corpora.  To  do  so,  we  use  an  existing  POS  tagger  (e.g.  Brill, 
1995)  to  annotate  the  English  side  of  the  parallel  corpus.  Then, 
as  illustrated  in  Eigure  1  for  Chinese  and  Erench,  the  raw  tags  are 
transferred  via  the  word  alignments,  yielding  an  extremely  noisy 
initial  training  set  for  the  2nd  language.  The  third  crucial  step  is  to 
generalize  from  these  noisy  projected  annotations  in  a  robust  way, 
yielding  a  stand-alone  POS  tagger  for  the  new  language  that  is  con¬ 
siderably  more  accurate  than  the  initial  projected  tags. 

Additional  details  of  this  algorithm  are  given  in  Yarowsky  and 
Ngai  (2001).  Due  to  lack  of  space,  the  following  sections  will  serve 
primarily  as  an  overview  of  the  algorithm  and  its  salient  issues. 

4.1  Part-of-speech  Projection  Issues 

First,  because  of  considerable  cross-language  differences  in  fine¬ 
grained  tag  set  inventories,  this  work  focuses  on  accurately  assign¬ 
ing  core  POS  categories  (e.g.  noun,  verb,  adverb,  adjective,  etc.), 


with  additional  distinctions  in  verb  tense,  noun  number  and  pro¬ 
noun  type  as  captured  in  the  English  tagset  inventory.  Although 
impoverished  relative  to  some  languages,  and  incapable  of  resolv¬ 
ing  details  such  as  grammatical  gender,  this  Brown-corpus-based 
tagset  granularity  is  sufficient  for  many  applications.  Eurthermore, 
many  finer-grained  part-of-speech  distinctions  are  resolved  primar¬ 
ily  by  morphology,  as  handled  in  Section  7.  Einally,  if  one  desires 
to  induce  a  finer-grained  tagging  capability  for  case,  for  example, 
one  should  project  from  a  reference  language  such  as  Czech,  where 
case  is  lexically  marked. 

Figure  3  illustrates  six  scenarios  encountered  when  projecting 
POS  tags  from  English  to  a  language  such  as  French.  The  first 
two  show  straightforward  1-to-l  projections,  which  are  encoun¬ 
tered  in  roughly  two-thirds  of  English  words.  Phrasal  (1-to-N) 
alignments  offer  greater  challenges,  as  typically  only  a  subset  of 
the  aligned  words  accept  the  English  tag.  To  distinguish  these 
cases,  we  initially  assign  position-sensitive  phrasal  parts-of-speech 
via  subscripting  (e.g.  LeVNNSa  /ow/NNSt),  and  subsequently  learn 
a  probablistic  mapping  to  core,  non-phrasal  parts  of  speech  (e.g. 
P(DT|NNSa))  that  is  used  along  with  tag  sequence  and  lexical  prior 
models  to  re-tag  these  phrasal  POS  projections. 

Tagger  Output  DT  NNS  VBG  NN 

English  The  laws  ...  ...  living  room  ... 


French  Les  lois  ...  0  ...  salon  ... 

Induced  Tags  DT  NNS  NN 

Tagger  Output  NNS  NNS  NNS  NNS 

English  Laws ...  0  Laws ...  ...  potatoes ...  ...  veterans ... 


French  Les  lois ...  Les  lois...  ...pommes  de  terre...  ...  anciens  combattants ... 

Induced  Tag  NNS  a  NNS  b  0  NNS  NNS  a  NNS  b  NNS  c  NNS  a  NNS  b 

Correct  Tag  (DT)  (NNS)  (DT)  (NNS)  (NNS)  (IN)  (NN)  (JJ)  (NNS) 

Figure  3:  French  POS  tag  projection  scenarios 

4.2  Noise-robust  POS  Tagger  Training 

Even  at  the  relatively  low  tagset  granularity  of  English,  direct 
projection  of  core  POS  tags  onto  French  achieves  only  76%  ac¬ 
curacy  using  Egypt’s  automatic  word  alignments  (as  shown  in 
Table  1).  Part  of  this  deficiency  is  due  to  word-alignment  error; 
when  word  alignments  were  manually  corrected,  direct  projection 
core-tag  accuracy  increased  to  85%.  Also,  standard  bigram  taggers 
trained  on  the  automatically  projected  data  achieve  only  modest 
success  at  generalization  (86%  when  reapplied  to  the  noisy  train¬ 
ing  data).  More  highly  lexicalized  learning  algorithms  exhibit  even 
greater  potential  for  overmodeling  the  specific  projection  errors  of 
this  data. 

Thus  our  research  has  focused  on  noise-robust  techniques  for 
distilling  a  conservative  but  effective  tagger  from  this  challenging 
raw  projection  data.  In  particular,  we  modify  standard  n-gram  mod¬ 
eling  to  separate  the  training  of  the  tag  sequence  model  P{T)  from 
the  lexical  prior  models  P(W\T),  and  apply  different  confidence 
weighting  and  signal  amplification  techniques  to  both. 

4. 2. 1  Lexical  Prior  Estimation 

Figure  4  illustrates  the  process  of  hierarchically  smoothing  the 
lexical  prior  model  P[t\w).  One  motivating  empirical  observation 
is  that  words  in  French,  English  and  Czech  have  a  strong  tendency 
to  exhibit  only  a  single  core  POS  tag  (e.g.  N  or  V),  and  very  rarely 
have  more  than  2.  In  English,  with  relatively  high  P(POS|w)  am¬ 
biguity,  only  0.37%  of  the  tokens  in  the  Brown  Corpus  are  not  cov¬ 
ered  by  a  word  type’s  two  most  frequent  core  tags,  and  in  Erench 
the  percentage  of  tokens  is  only  0.03%.  Thus  we  employ  an  ag- 
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(a)  Direct  transfer  (on  auto-aligned  data) 

N/A 

N/A 

(b)  Direct  transfer  (on  hand-aligned  data) 

N/A 

N/A 

(c)  Standard  bigram  model  (on  auto-aligned  data) 

.86 

.82 

.82 

(d)  Noise-robust  bigram  induction  (on  auto-aligned  data) 

.93 

.94 

(e)  Fully  supervised  bigram  training  (on  goldstandard) 

.97 

.96 

.98 

.97 

Table  1:  Evaluation  of  5  POS  tagger  induction  models  on  2  French  datasets  and  2  tagset  granularities 


gressive  re-estimation  in  favor  of  this  bias,  amplifying  the  model 
probability  of  the  majority  POS  tag,  and  reducing  or  zeroing  the 
model  probability  of  2nd  or  lower  ranked  core  tags  proportional 
to  their  relative  frequency  with  respect  to  the  majority  tag.  This 
process  is  then  applied  recursively,  similarly  amplifying  the  proba¬ 
bility  of  the  majority  subtags  within  each  core  tag.  Further  details, 
including  the  handling  of  i-to-Af  phrasal  alignment  projections,  are 
given  in  Yarowsky  and  Ngai  (2001). 
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Figure  4:  Hierarchical  smoothing  of  P{t\w)  tag  probabilities 


4.2.2  Tag  Sequence  Model  Estimation 

In  contrast,  the  training  of  the  tag  sequence  model  ...) 

focuses  on  confidence  weighting  and  filtering  of  projected  training 
subsequences.  The  contribution  of  each  candidate  training  sentence 
is  weighted  proportionally  with  both  its  EGYPT/GIZA  sentence- 
level  alignment  score  and  an  agreement  measure  between  the  pro¬ 
jected  tags  and  the  1st  iteration  lexical  priors,  a  rough  measure 
of  alignment  reasonableness.  Given  the  observed  bursty  distri¬ 
bution  of  alignment  errors  in  the  corpus,  this  downweighting  of 
low-confidence  alignment  regions  substantially  improves  sequence 
model  quality  with  tolerable  reduction  in  training  volume. 

4.3  Evaluation  of  POS  Tagger  Induction 

As  shown  in  Table  1,  performance  is  evaluated  on  two  evalua¬ 
tion  data  sets,  including  an  independent  200K-word  hand-tagged 
French  dataset  provided  by  Universite  de  Montreal,  which  is  used 
to  gauge  stand-alone  tagger  performance.  Signal  amplification  and 
noise  reduction  techniques  yield  a  7 1  %  error  reduction,  achieving  a 
core  tagset  accuracy  of  96%,  closely  approaching  the  upper-bound 
97%  performance  of  an  equivalent  bigram  model  trained  directly 
on  an  80%  subset  of  the  hand-tagged  evaluation  set  (using  5-fold 
cross-validation).  Thus  robust  training  on  500K  words  of  very 
noisy  but  automatically-derived  tag  projections  can  approach  the 
performance  obtained  by  fully  supervised  learning  on  80K  words 
of  hand-tagged  training  data. 


5.  NOUN  PHRASE  BRACKETER 
INDUCTION 

Our  empirical  studies  show  that  there  is  a  very  strong  tendency 
for  noun  phrases  to  cohere  as  a  unit  when  translated  between  lan¬ 
guages,  even  when  undergoing  significant  internal  re-ordering.  This 
strong  noun-phrase  cohesion  even  tends  to  hold  for  relatively  free 
word  order  languages  such  as  Czech,  where  both  native  speakers 
and  parallel  corpus  data  indicate  that  nominal  modifiers  tend  to  re¬ 
main  in  the  same  contiguous  chunk  as  the  nouns  they  modify.  This 
property  allows  collective  word  alignments  to  serve  as  a  reliable 
basis  for  bracket  projection  as  well. 

5.1  BaseNP  Projection  Methodology 

The  projection  process  begins  by  automatically  tagging  and  brac¬ 
keting  the  English  data,  using  Brill  (1995)  and  Ramshaw  &  Marcus 
(1994),  respectively. 

As  illustrated  in  Figure  5,  each  word  within  an  English  noun 
phrase  is  then  subscripted  with  the  number  of  its  NP  in  the  sentence, 
and  this  subscript  is  projected  onto  the  aligned  French  (or  Chinese) 
words.  In  the  most  common  case,  the  corresponding  French/Chinese 
noun  phrase  is  simply  the  maximal  span  of  the  projected  subscript. 

Figure  6  shows  some  of  the  projection  challenges  encountered. 
Nearly  all  such  cases  of  interwoven  projected  NPs  are  due  to  align¬ 
ment  errors,  and  a  strong  inductive  bias  towards  NP  cohesion  was 
utilized  to  resolve  these  incompatible  projections. 


Figure  5:  Standard  NP  projection  scenarios. 

[DTi  Ji  Nl]  VBD  [N2  Nz] 

[DT(i)  N(i)]  VBD  {J(i)  de 


Figure  6:  Problematic  NP  projection  scenarios. 


5.2  BaseNP  Training  Algorithm 

For  stand-alone  tool  development,  the  Ramshaw  &  Marcus  lOB 
bracketing  framework  and  a  fast  transformation-based  learning  sys¬ 
tem  (Ngai  and  Florian,  2001)  were  applied  to  the  noisy  baseNP- 
projected  data  described  above. 

As  with  POS  tagger  induction,  bracketer  induction  is  improved 
by  focusing  training  on  the  highest  quality  projected  data  and  ex¬ 
cluding  regions  with  the  strongest  indications  of  word-alignment 
error.  Thus  sentences  with  the  lowest  25%  of  model-3  alignment 
scores  were  excluded  from  training,  as  were  sentences  where  pro¬ 
jected  bracketings  overlapped  and  conflicted  (also  an  indicator  of 


alignment  errors).  Data  with  lower-confidence  POS  tagging  were 
not  filtered,  however,  as  this  filtering  reduces  robustness  when  the 
stand-alone  bracketers  are  applied  to  noisy  tagger  output.  Addi¬ 
tional  details  are  provided  in  Yarowsky  and  Ngai  (2001). 

Current  efforts  to  further  improve  the  quality  of  the  training  data 
include  use  of  iterative  EM  bootstrapping  techniques.  Separate  pro¬ 
jection  of  bracketings  from  aligned  parallel  data  with  a  3rd  lan¬ 
guage  also  shows  promise  for  providing  independent  supervision, 
which  can  further  help  distinguish  consensus  signal  from  noise. 

5.3  BaseNP  Projection  Evaluation 

Because  no  bracketed  evaluation  data  were  available  to  us  for 
French  or  Chinese,  a  third  party  fluent  in  these  languages  hand- 
bracketed  a  small,  held-out  40-sentence  evaluation  set  in  both  lan¬ 
guages,  using  a  set  of  bracketing  conventions  that  they  felt  were 
appropriate  for  the  languages.  Table  2  shows  the  performance  rela¬ 
tive  to  these  evaluation  sets,  as  measured  by  exact-match  bracketing 
precision  (Pr),  recall  (R)  and  F-measure  (F). 


Exact  Match 

Acceptable  Match 

Method 

Pr  1  R  1  F 

Pr  1  R  1  F 

Chinese: 


Direct  (auto) 

.26 

.58 

.36 

.48 

.58 

.51 

Direct  (hand) 

.47 

.61 

.53 

.86 

.86 

.86 

French: 


Direct  (auto) 

.43 

.48 

.45 

.60 

.58 

.59 

Direct  (hand) 

.56 

.51 

.53 

.74 

.70 

.72 

FTBL  (auto) 

.82 

.81 

.81 

.91 

.91 

.91 

Table  2:  Performance  of  BaseNP  induction  models 

It  is  important  to  note,  however,  that  many  decisions  regarding 
BaseNP  bracketing  conventions  are  essentially  arbitrary,  and  agree¬ 
ment  rates  between  additional  human  Judges  on  these  data  were 
measured  at  64%  and  80%  for  French  and  Chinese  respectively. 
Since  the  translingual  projections  are  essentially  unsupervised  and 
have  no  data  on  which  to  mimic  arbitrary  conventions,  it  is  also  rea¬ 
sonable  to  evaluate  the  degree  to  which  the  induced  bracketings  are 
deemed  acceptable  and  consistent  with  the  arbitrary  goldstandard 
(e.g.  no  crossing  brackets).  To  this  end,  an  additional  pool  of  3 
judges  were  asked  to  further  adjudicate  the  differences  between  the 
goldstandard  and  the  projection  output,  annotating  such  situations 
as  either  acceptable/compatible  or  unacceptable/incompatible. 

Overall,  these  translingual  projection  results  are  quite  encourag¬ 
ing.  For  the  Chinese,  they  are  similar  to  Wu’s  78%  precision  re¬ 
sult  for  translingual-grammar-based  NP  bracketing,  and  especially 
promising  given  that  no  word  segmentation  (only  raw  characters) 
were  used.  For  French,  the  increase  from  59%  to  91%  F-measure 
for  the  stand-alone  induced  bracketer  shows  that  the  training  algo¬ 
rithm  is  able  to  generalize  successfully  from  the  noisy  raw  projec¬ 
tion  data,  distilling  a  reasonably  accurate  (and  transferable)  model 
of  baseNP  structure  from  this  high  degree  of  noise. 

6.  NAMED  ENTITY  TAGGER  INDUCTION 

Multilingual  named  entity  tagger  induction  is  based  on  the  ex¬ 
tended  combination  of  the  part-of-speech  and  noun-phrase  brac¬ 
keting  frameworks.  The  entity  class  tags  used  for  this  study  were 
FName,  LName,  Place  and  Other  (other  entities  including  or¬ 
ganizations).  They  were  derived  from  an  anonymously  donated 
MUC-6  named  entity  tagger  applied  to  the  English  side  of  the  French- 
English  Canadian  Hansards  data. 

Initial  classification  proceeds  on  a  per-word  basis,  using  an  ag¬ 
gressively  smoothed  transitive  projection  model  similar  to  those  de¬ 


scribed  in  Section  7.  For  a  given  second-language  word  FW  and  all 
English  words  EWi  aligned  to  it: 

P(NEclasSj|FW)  =  ^P(NEclasSj|EWi)  Pa (EWijFW) 
i 

P(PLACE|Coree)  =  P(PLACE|Korea)  Pa(Korea|Coree)  -|- ... 

The  co-training-based  algorithm  given  in  Cucerzan  and  Yarowsky 
(1999)  was  then  used  to  train  a  stand-alone  named  entity  tagger 
from  the  projected  data.  Seed  words  for  this  algorithm  were  those 
French  words  that  were  both  POS-tagged  as  proper  nouns  and  had 
an  above-threshold  entity-class  confidence  from  the  lexical  projec¬ 
tion  models. 

Performance  was  measured  in  terms  of  per-word  entity-type  clas¬ 
sification  accuracy  on  the  French  Hansard  test  data,  using  the  4- 
class  inventory  listed  above.  Classification  accuracy  of  raw  tag 
projections  was  only  64%  (based  on  automatic  word  alignment). 
In  contrast,  the  stand-alone  co-training-based  tagger  trained  on  the 
projections  achieved  85%  classification  accuracy,  illustrating  its  ef- 
fectivess  at  generalization  in  the  face  of  projection  noise.  Notably, 
most  of  its  observed  errors  can  be  traced  to  entity  classification  er¬ 
rors  from  the  original  English  tagger.  In  fact,  when  evaluated  on  the 
English  translation  of  the  French  test  data  set,  the  English  tagger 
only  achieved  86%  classification  accuracy  on  this  directly  compa¬ 
rable  data  set.  It  appears  that  the  projection-induced  French  tagger 
achieves  performance  nearly  as  high  as  its  original  training  source. 
Thus  further  improvements  should  be  expected  from  higher  quality 
English  training  sources. 

7.  MORPHOLOGICAL  ANALYSIS 
INDUCTION 

Bilingual  corpora  can  also  serve  as  a  very  successful  bridge  for 
aligning  complex  inflected  word  forms  in  a  new  language  with  their 
root  forms,  even  when  their  surface  similarity  is  quite  different  or 
highly  irregular. 


Figure  7:  Direct-bridge  French  inflection/root  alignment 

As  illustrated  in  Figure  7,  the  association  between  a  French  ver¬ 
bal  inflection  (croyant)  and  its  correct  root  (croire),  rather  than  a 
similar  competitor  (croitre),  can  be  identified  by  a  single-step  tran¬ 
sitive  association  via  an  English  bridge  word  (believing).  However, 
in  the  case  of  morphology  induction,  such  direct  associations  are 
relatively  rare  given  that  inflections  in  a  second  language  tend  to  as¬ 
sociate  with  similar  tenses  in  English  while  the  singular/infinitive 
forms  tend  to  associate  with  analogous  singular/infinitive  forms, 
and  thus  croyaient  (believed)  and  its  root  croire  have  no  direct  En¬ 
glish  link  in  our  aligned  corpus. 

However,  Figure  2  (first  page)  illustrates  that  an  existing  invest¬ 
ment  in  a  lemmatizer  for  English  can  help  bridge  this  gap  by  Joining 
a  multi-step  transitive  association  croyaient^believed-^believe^ 
croire.  Figure  8  illustrates  how  this  transitive  linkage  via  English 


Figure  8:  Multi-bridge  French  inflection/root  alignment 


lemmatization  can  be  potentially  utilized  for  all  other  English  lem¬ 
mas  (such  as  THINK)  with  which  croyaient  and  croire  also  asso¬ 
ciate,  offering  greater  potential  coverage  and  robustness  via  multi¬ 
ple  bridges. 

Formally,  these  multiple  transitive  linkages  can  he  modeled  as 
shown  below,  by  summing  over  all  English  lemmas  (Eiemi )  with 
which  either  a  candidate  foreign  inflection  or  its  root  (Fwot) 
exhibit  an  alignment  in  the  parallel  corpus: 


Pmp  {^Froot\Finfl^  —  ^Froot\Fle7ni  )  Fa  {Flcrrii  \  Finfl ) 

i 


For  example: 

Fmp(croire|croyaient)  = 

Pa  (croire|BELiEVE)  Pa  (BELiEVE|croyaient)-|- 
Pa(croire|THiNK)  Pa (THiNKjcroyaient)  -|-  ... 

This  projection/bridge-hased  similarity  measure  Pmp{Froot\Finfl) 
can  be  quite  effective  on  its  own,  as  shown  in  the  MProj  only  entries 
in  Table  3  (for  multiple  parallel  corpora  in  3  different  languages), 
especially  when  restricted  to  the  highest-confidence  subset  of  the 
vocabulary  (5.2%  to  77.9%  in  these  data)  for  which  the  association 
exceeds  simple  fixed  probability  and  frequency  thresholds.  When 
estimated  using  a  1.2  million  word  subset  of  the  French  Hansards, 
for  example,  the  MProj  measure  alone  achives  98.5%  precision 
on  32.7%  of  the  inflected  French  verbs  in  the  corpus  (constitut¬ 
ing  97.6%  of  the  tokens  in  the  corpus).  Unlike  traditional  string- 
transduction-based  morphology  induction  methods  where  irregular 
verbs  pose  the  greatest  challenges,  these  typically  high-frequency 
words  are  often  the  best  modelled  data  in  the  vocabulary  making 
these  multilingual  projection  techniques  a  natural  complement  to 
existing  models. 

7.1  Trie-based  Morphology  Models 

The  high  precision  on  the  MProj -covered  subset  also  make  these 
partial  pairings  effective  training  data  for  robust  supervised  algo¬ 
rithms  that  can  generalize  the  string  transformation  behavior  to  the 
remaining  uncovered  vocabulary.  While  any  supervised  morpho¬ 
logical  analysis  technique  is  possible  here,  we  employ  a  trie-based 
modeling  technique  where  the  probability  of  a  given  stem-change 
(from  the  inventory  observed  in  the  MProj -paired  training  data)  is 
modeled  hierarchically  using  variable  suffix  context,  as  described 
in  Yarowsky  and  Wicentowski  (2000): 

P(root|inflection)  =  P{df3\da)  —  P{a  —I  /3|<5a)  = 

'^XiP{a  —I  /3\hi)  for  hi  —  suffix((,  <5q) 


For  example: 

P(commencer|commen9a)  =  ^’(fa  —I  cerjcommenfa)  = 

AoP(5a  —I  cer)  -|-  AiP(5a  — >■  cer|a)  -|-  A2P(5a  — >■  cerjfa)-!- 
-|-A3P(5a  —I  cerlnja)  -|-  A4P(5a  —I  cerjenja)  -|- ... 


P(ployer|ploie)  =  P(ie  — I  yerjploie)  = 

AoP(ie  —I  yer)  +  AiP(ie  —I  yer|e)  +  A2P(ie  —I  yer|ie)-|- 
-|-A3P(ie  —I  yerjoie)  -|-  A4P(ie  —I  yerjloie)  +  ... 

An  important  property  of  the  trie-based  models  is  their  effective¬ 
ness  at  clustering  words  that  exhibit  similar  morphological  behav¬ 
ior,  both  reducing  model  size  and  facilitating  generalization  to  pre¬ 
viously  unseen  examples.  This  property  is  illustrated  in  Figure  9, 
showing  a  sample  (inflection  —I  root)  trie  branch  for  French  verbal 
inflections,  with  suffix  histories  h=’oie\  h=’noie’,  h=’roie’,  etc. 
At  each  history  node,  the  hierarchically  smoothed  probabilities  of 
several  a  ^  /3  (inflection— iroot)  changes  are  given.  Note  that  the 
relative  probabilities  of  the  competing  analyses  ie^ir  and  ie^yer 
differ  substantially  for  diffent  suffix  histories,  and  that  there  are 
subexceptions  that  tend  to  cluster  by  affix  history.  This  allows  for 
the  successful  analysis  of  8  of  the  9  italicized  test  words  that  had 
not  been  seen  in  the  bilingual  projection  data  or  where  the  MProj 
model  yielded  no  root  candidate  above  threshold. 

root  P(a-|5|h) 


Figure  9:  Example  of  a  French  MTrie  branch,  showing  inflec¬ 
tion  —I  root  probabilities  (P(q  — i  fi\hi))  for  variable  length 
suffix  histories  (hi).  MTrie  analyses  on  test  data  are  given  in 
italics. 

Table  3  illustrates  the  performance  of  a  variety  of  morphology 
induction  models.  When  using  the  projection-based  MProj  and 
trie-based  MTrie  models  together  (with  the  latter  extending  cover¬ 
age  to  words  that  may  not  even  appear  in  the  parallel  corpus),  full 


verb  lemmatization  precision  on  the  1.2M  word  Hansard  subset  ex¬ 
ceeds  99.5%  (by  type)  and  99.9%  (by  token)  with  95.8%  coverage 
by  type  and  99.8%  coverage  by  token.  A  backoff  model  based  on 
Levenshtein-distance  and  distributional  context  similarity  handles 
the  relatively  small  percentage  of  cases  where  MProj  and  MTrie 
together  are  not  sufficiently  confident,  bringing  the  system  cov¬ 
erage  to  100%  coverage  with  a  small  drop  in  precision  to  97.9% 
(by  type)  and  99.8%  (by  token)  on  the  unrestricted  space  of  in¬ 
flected  verbs  observed  in  the  full  French  Hansards.  As  shown  in 
Section  7.3,  performance  is  strongly  correlated  with  size  of  the  ini¬ 
tial  aligned  bilingual  corpus,  with  a  larger  Hansard  subset  of  12M 
words  yielding  99.4%  precision  (by  type)  and  99.9%  precision  (by 
token).  Performance  on  Czech  is  discussed  in  Section  7.3. 


Precision 

Coverage 

Model 

Typ  1  Tok 

Typ  1  Tok 

FRENCH  Verbal  Morphology  Induction 

French  Hansards  (12M  words): 


CZECH  Verbal  Morphology  Induction 

Czech  Reader’s  Digest  (500K  words): 


MProj  only 

.915 

.993 

.152 

.805 

MProj -l-MTrie 

.916 

.917 

.893 

.975 

MProj  -l-MTrie-l-B  KM 

.878 

.913 

1.00 

1.00 

SPANISH  Verbal  Morphology  Induction 

Spanish  Bible  (300K  words)  via  1  English  Bible: 


MProj  only 

.973 

.935 

.264 

.351 

MProj -l-MTrie 

.988 

.998 

.971 

.967 

MProj  -l-MTrie-l-B  KM 

.966 

.985 

1.00 

1.00 

Spanish  Bible  (300K  words)  via  French  Bible: 


MProj  only 

.980 

.935 

.722 

.765 

MProj -l-MTrie 

.983 

.974 

.986 

.993 

MProj  -l-MTrie-l-B  KM 

.974 

.968 

1.00 

1.00 

Spanish  Bible  (300K  words)  via  3  English  Bibles: 


MProj  only 

.964 

.948 

.468 

.551 

MProj -l-MTrie 

.990 

.998 

.978 

.987 

MProj -l-MTrie 

.976 

.987 

1.00 

1.00 

Table  3:  Performance  of  full  verbal  morphological  analysis, 
including  precision/coverage  by  type/token 


7.2  Morphology  Induction  via  Aligned  Bibles 

Performance  using  even  small  parallel  corpora  (e.g.  a  120K  sub¬ 
set  of  the  French  Hansards)  still  yields  a  respectable  93.2%  (type) 
and  98.9%  (token)  precision  on  the  verb-lemmatization  test  set  for 
the  full  Hansards.  Given  that  the  Bible  is  actually  larger  (approxi¬ 
mately  300K  words,  depending  on  version  and  language)  and  avail¬ 
able  on-line  or  via  OCR  for  virtually  all  languages  (Resnik  et  al., 
2000),  we  also  conducted  several  experiments  on  Bible-based  mor¬ 
phology  induction,  further  detailed  in  Table  3. 

7.2.1  Boosting  Performance  via  Multiple 
Parallel  Translations 

Even  though  at  most  one  translation  of  the  Bible  is  typically 
available  in  a  given  foreign  language,  numerous  English  Bible  ver¬ 
sions  are  freely  available  and  a  performance  increase  can  be  achie¬ 
ved  by  simultaneously  utilizing  alignments  to  each  English  version. 
As  illustrated  in  Figure  10,  different  aligned  Bible  pairs  may  exhibit 
(or  be  missing)  different  full  or  partial  bridge  links  for  a  given  word 
(due  both  to  different  lexical  usage  and  poor  textual  parallelism  in 
some  text-regions  or  version  pairs).  However,  Pa{Froot\Eiemi) 
and  Pa{Eiemi  \  Finfl)  need  not  be  estimated  from  the  same  Bible 
pair.  Even  if  one  has  only  one  Bible  in  a  given  source  language, 
each  alignment  with  a  distinct  English  version  gives  new  bridging 
opportunities  with  no  additional  resources  needed  on  the  source 
language  side.  The  baseline  approach  (evaluated  here)  is  simply  to 
concatenate  the  different  aligned  versions  together.  While  word- 
pair  instances  translated  the  same  way  in  each  version  will  be  re¬ 
peated,  this  rather  reasonably  reflects  the  increased  confidence  in 
this  particular  alignment.  An  alternate  model  would  weight  version 
pairs  differently  based  on  the  otherwise-measured  translation  faith¬ 
fulness  and  alignment  quality  between  the  version  pairs.  Doing  so 
would  help  decrease  noise.  Increasing  from  1  to  3  English  versions 
reduces  the  type  error  rate  (at  full  coverage)  by  22%  on  Erench  and 
28%  on  Spanish  with  no  increase  in  the  source  language  resources. 


Figure  10:  Use  of  multiple  parallel  Bible  translations 


7. 2. 2  Boosting  Performance  via  Multiple 
Bridge  Languages 

Once  lemmatization  capabilities  have  been  successfully  projected 
to  a  new  language  (such  as  Erench),  this  language  can  then  serve  as 
an  additional  bridging  source  for  morphology  induction  in  a  third 
language  (such  as  Spanish),  as  illustrated  in  Figure  11.  This  can 
be  particularly  effective  if  the  two  languages  are  very  similar  (as 
in  Spanish-French)  or  if  their  available  Bible  versions  are  a  close 
translation  of  a  common  source  (e.g.  the  Latin  Vulgate  Bible).  As 
shown  in  Table  3,  using  the  previously  analyzed  French  Bible  as  a 
bridge  for  Spanish  achieves  performance  (97.4%  precision)  com- 


parable  to  the  use  of  3  parallel  English  Bible  versions. 


Figure  11:  Use  of  bridges  in  multiple  languages. 


7.3  Morphology  Induction:  Observations 

This  section  includes  additional  detail  regarding  the  morphol¬ 
ogy  induction  experiments,  supplementing  the  previous  details  and 
analyses  given  in  Section  7  and  Table  3. 

•  Performance  induction  using  the  French  Bible  as  the  bridge 
source  is  evaluated  using  the  full  test  verb  set  extracted  from 
the  French  Hansards.  The  strong  performance  when  trained 
only  using  the  Bible  illustrates  that  even  a  small  single  text 
in  a  very  different  genre  can  provide  effective  transfer  to 
modern  (conversational)  French.  While  the  observed  genre 
and  topic-sensitive  vocabulary  differs  substantially  between 
the  Bible  and  Hansards,  the  observed  inventories  of  stem 
changes  and  suffixation  actually  have  large  overlap,  as  do  the 
set  of  observed  high-frequency  irregular  verbs.  Thus  the  in¬ 
ventory  of  morphological  phenomena  seem  to  translate  better 
across  genre  than  do  lexical  choice  and  collocation  models. 

•  Over  60%  of  errors  are  due  to  gaps  in  the  candidate  rootlists. 
Currently  the  candidate  rootlists  are  derived  automatically  by 
applying  the  projected  POS  models  and  selecting  any  word 
with  the  probability  of  being  an  uninflected  verb  greater  than 
a  generous  threshold  and  also  ending  in  a  canonical  verb  suf¬ 
fix.  False  positives  are  easily  tolerated  (less  than  5%  of  errors 
are  due  to  spurious  non-root  competitors),  but  with  missing 
roots  the  algorithms  are  forced  either  to  propose  previously 
unseen  roots  or  align  to  the  closest  previously  observed  root 
candidate.  Thus  while  no  non-English  dictionary  was  used 
in  the  computation  of  these  results,  it  would  substantially 
improve  performance  to  have  a  dictionary-based  inventory 
of  potential  roots,  increasing  coverage  and  decreasing  noise 
from  competing  non-roots  and  spelling  errors. 

•  Performance  in  all  languages  has  been  significnatly  hindered 
by  low-accuracy  parallel-corpus  word-alignments  using  the 
original  Model-3  GIZA  tools.  Use  of  Och  and  Ney’s  re¬ 
cently  released  and  enhanced  GIZA-l-l-  word-alignment  mod¬ 
els  (Och  and  Ney,  2000)  should  improve  performance  for  all 
of  the  applications  studied  in  this  paper,  as  would  iterative  re¬ 
alignments  using  richer  alignment  features  (including  lemma 
and  part-of-speech)  derived  from  this  research. 

•  The  current  somewhat  lower  performance  on  Czech  is  due 
to  several  factors.  They  include  (a)  very  low  accuracy  ini¬ 
tial  word-alignments  due  to  often  non-parallel  translations 
of  the  Reader’s  Digest  sample  and  the  failure  of  the  initial 
word-alignment  models  to  handle  the  highly  inflected  Czech 


morphology,  (b)  the  small  size  of  the  Czech  parallel  cor¬ 
pus  (less  than  twice  the  length  of  the  Bible),  (c)  the  com¬ 
mon  occurrence  in  Czech  of  two  very  similar  perfective  and 
non-perfective  root  variants  (e.g.  odoldvat  and  odolat,  both 
of  which  mean  to  resist).  A  simple  monolingual  dictionary- 
derived  list  of  canonical  roots  would  resolve  ambiguity  re¬ 
garding  which  is  the  appropriate  target. 

•  Many  of  the  errors  are  due  to  all  (or  most)  inflections  of  a  sin¬ 
gle  verb  mapping  to  the  same  incorrect  root.  But  for  many 
applications  where  the  function  of  lemmatization  is  to  cluster 
equivalent  words  (e.g.  stemming  for  information  retrieval), 
the  choice  of  label  for  the  lemma  is  less  important  than  cor¬ 
rectly  linking  the  members  of  the  lemma. 


Size  of  Aligned  Corpus  (words) 


Figure  12:  Ueaming  Curves  for  French  Morphology 

•  The  learning  curves  in  Figure  12  show  the  strong  correlation 
between  performance  and  size  of  the  aligned  corpus.  Given 
that  large  quantities  of  parallel  text  currently  exist  in  trans¬ 
lation  bureau  archives  and  OCR-able  books,  not  to  mention 
the  increasing  online  availability  of  bitext  on  the  web,  the 
natural  growth  of  available  bitext  quantities  should  continue 
to  support  performance  improvement. 

•  The  system  analysis  examples  shown  in  Table  4  are  repre¬ 
sentative  of  model  performance  and  are  selected  to  illustrate 
the  range  of  encountered  phenomena.  All  system  evaluation 
is  based  on  the  task  of  selecting  the  correct  root  for  a  given 
inflection  (which  has  a  long  lexicography-based  consensus 
regarding  the  “truth”).  In  contrast,  the  descriptive  analysis  of 
any  such  pairing  is  very  theory  dependent  without  standard 
consensus.  The  “TopBridge”  column  shows  the  strongest 
English  bridge  lemma  utilized  in  mapping  (typically  one  of 
many  potential  bridge  lemmas). 

These  results  are  quite  impressive  in  that  they  are  based  on  essen¬ 
tially  no  language-specific  knowledge  of  French,  Spanish  or  Czech. 
In  addition,  the  multilingual  bridge  algorithm  is  surface-form  in¬ 
dependent,  and  can  just  as  readily  handle  obscure  infixational  or 
reduplicative  morphological  processes. 

8.  CONCLUSION 

This  paper  has  presented  a  detailed  survey  of  original  algorithms 
for  cross-language  annotation  projection  and  noise-robust  tagger 
induction,  evaluated  on  four  diverse  applications.  It  shows  how 
previous  major  investments  in  English  annotated  corpora  and  tool 
development  can  be  effectively  leveraged  across  languages,  achiev¬ 
ing  accurate  stand-alone  tool  development  in  other  languages  with¬ 
out  comparable  human  annotation  efforts.  Collectively  this  work  is 


the  most  comprehensive  existing  exploration  of  a  very  promising 
new  paradigm  for  cross-language  resource  projection. 
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Induced  Morphological  Analyses  for  CZECH 


Inflection 

Root  Out 

Analysis 

TopBridge 

bral 

brat 

al— ^at 

marry 

brala 

brat 

ala— ^  at 

accept 

brali 

brat 

ali— ^at 

marry 

byl 

byt 

yl^yt 

be 

byli 

byt 

yli— >-yt 

be 

bylo 

byt 

ylo— ^yt 

be 

chovala 

chovat 

la— ^t 

behave 

chova 

chovat 

a— ^at 

behave 

chovame 

chovat 

ame— ^at 

behave 

chodila 

chodit 

la->-t 

walk 

chodi 

chodit 

i->-it 

walk 

chodte 

chodit 

dte— ^dit 

swim 

chranila 

chranit 

la— ^t 

protect 

chranl 

chranit 

i— >-it 

protect 

couval 

couvat 

l-)-t 

back 

chce 

chtit 

ce— ^tit 

want 

chcete 

chtit 

cete— ^tit 

want 

chces 

chtit 

ces— ^tit 

want 

chci 

chtit 

ci— ^tit 

want 

chtejl 

chtit 

eji->-it 

want 

chteli 

chtit 

eli— ^it 

want 

chtelo 

chtit 

elo— ^it 

want 

Induced  Morphological  Analyses  for  SPANISH 


Inflection 

Root  Out 

Analysis 

TopBridge 

aborrecio 

aborrecer 

id— ^er 

hate 

aborrecia 

aborrecer 

ia— ^er 

hate 

aborrezco 

aborrecer 

zco— ^cer 

hate 

abrace 

abrazar 

ce— ^zar 

embrace 

abrazado 

abrazar 

ado— ^ar 

embrace 

adquiere 

adquirir 

ere— ^rir 

get 

andamos 

andar 

amos— ^ar 

walk 

andando 

andar 

ando— ^ar 

walk 

andaran 

andar 

aran— ^ar 

wander 

andaras 

andar 

aras— ^ar 

wander 

andemos 

andar 

emos— ^ar 

walk 

anden 

andar 

en— ^ar 

walk 

anduvo 

andar 

uvo— ^ar 

walk 

buscais 

buscar 

ais— ^ar 

seek 

bused 

buscar 

d— ^ar 

seek 

busque 

buscar 

que— ^car 

seek 

busque 

buscar 

que— ^car 

seek 

Induced  Morphological  Analyses  for  FRENCH 


Inflection 

Root  Out 

Analysis 

TopBridge 

abrege 

abreger 

ege— ^eger 

shorten 

abregent 

abreger 

egent— ^eger 

shorten 

abregerai 

abreger 

erai— ^er 

curtail 

achete 

acheter 

ete— ^eter 

buy 

achetent 

acheter 

etent— ^eter 

buy 

achetera 

acheter 

etera— ^eter 

buy 

advenait 

advenir 

ait— ^ir 

happen 

advenu 

advenir 

u— ^ir 

happen 

adviendrait 

advenir 

iendrait— ^enir 

happen 

advient 

advenir 

ient— ^enir 

happen 

aliene 

aliener 

ene— ^ener 

alienate 

alienent 

aliener 

enent— ^ener 

alienate 

confu 

concevoir 

9U— ^cevoir 

conceive 

crois 

croire 

s— ^re 

believe 

croyaient 

croire 

yaient— ^ire 

believe 

Table  4:  Sample  of  induced  morphological  analyses 


